<h1 align='center'>
A Brief Introduction to NLP and the Python NLTK toolkit
</h1>
<br>

<h2 align='center'>
<div>Sam Carton</div>

<div>School of Information</div>
</h2>

<h2>NLP: Natural Language Processing</h2>

Basically the science of developing useful numerical representations of text in order to do various tasks:
<ul>
<li>Classification</li>
<li>Clustering</li>
<li>Scientific analysis (i.e. relating derived quantities)</li>
<li>Retrieval</li>
<li>Matching</li>
<li>Lots of other stuff</li>

</ul>



<h2>Variable-length text to fixed-length numbers/vectors (mostly)</h2>

"I love coffee" --> [1 0 1 1 0]

"I hate coffee" --> [1 1 1 0 0]

"I hate that I love coffee" --> [1 1 2 1 1]

"I love that I hate coffee" --> [1 1 2 1 1]


# <h2>NLTK: The most popular general-purpose Python NLP library</h2>


<h2>I. Tokenization and bag-of-words representations</h2>

In [6]:
# Almost always the first step: tokenization
import nltk

sentence = "I love coffee"

tokens = nltk.word_tokenize(sentence)

print(tokens)


['I', 'love', 'coffee']


In [17]:
# Easiest way to convert from tokens to vectors: bag-of-words representation
sentences = ["I love coffee",
             "I love coffee",
            "I hate coffee",
            "I hate that I love coffee",
            "I love that I hate coffee"]

#Step 1: tokenize all documents
sentence_tokens = [nltk.word_tokenize(sentence) for sentence in sentences]

#Step 2: Construct the vocabulary
all_tokens = []
for token_list in sentence_tokens:
    for token in token_list:
        all_tokens.append(token)
        
print("All tokens: {}".format(all_tokens))

vocabulary = sorted(list(set(all_tokens)))
print("Vocabulary (sorted and de-duplicated): {}".format(vocabulary))


All tokens: ['I', 'love', 'coffee', 'I', 'hate', 'coffee', 'I', 'hate', 'that', 'I', 'love', 'coffee', 'I', 'love', 'that', 'I', 'hate', 'coffee']
Vocabulary (sorted and de-duplicated): ['I', 'coffee', 'hate', 'love', 'that']


In [16]:
#Step 3: Construct a vector the size of the vocabulary for each document, indicating whether that document contains each vocab word in the appropriate column

sentence_vectors = []
for token_list in sentence_tokens:
    sentence_vector = [0]*len(vocabulary)
    for token in token_list:
        word_index = vocabulary.index(token)
        sentence_vector[word_index] += 1
    sentence_vectors.append(sentence_vector)
    
#We should now have an appropriate vector for each sentence
print('Vocabulary: {}'.format(vocabulary))
print('Sentence vectors:')
for sentence,vector in zip(sentences, sentence_vectors):
    print('\t"{}": {}'.format(sentence,vector))

Vocabulary: ['I', 'coffee', 'hate', 'love', 'that']
Sentence vectors:
	"I love coffee": [1, 1, 0, 1, 0]
	"I hate coffee": [1, 1, 1, 0, 0]
	"I hate that I love coffee": [2, 1, 1, 1, 1]
	"I love that I hate coffee": [2, 1, 1, 1, 1]


In [49]:
#Scikit-learn library has some stuff that will do this for you
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(lowercase=False,vocabulary=vocabulary,tokenizer=nltk.word_tokenize)
matrix = vectorizer.fit_transform(sentences)
print(matrix.todense())


[[1 1 0 1 0]
 [1 1 1 0 0]
 [2 1 1 1 1]
 [2 1 1 1 1]]


<h2>II. What to do with these vector representations?</h2>

We're going to do text classification. In particular, we're trying to predict whether comments on the /r/politics subreddit will get more or fewer upvotes than average based on the text of the comment.

In [None]:
#Ignore these functions for now
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
import csv
def run_ML_experiment(train_comments, train_labels, test_comments, test_labels, process_function = None):
    '''
    Run a machine learning experiment on the provided data. Generate feature representations of input texts
    by passing them through the provided function.


    :param train_comments: a list of text comments
    :param train_labels: a list of numeric 0 or 1 labels
    :param test_comments: list of text comments
    :param test_labels: list of numeric 0 or 1 labels
    :param function: a function that takes in a comment and produces a list of features
    :return:
    '''

    if process_function:
        print('\nDoing text classification experiment using {} as the processing function for the text'.format(process_function.__name__))
    else:
        print('\nDoing text classification experiment without any processing function')

    if process_function:
        train_vals = [' '.join(process_function(comment)) for comment in train_comments]
    else:
        train_vals = train_comments

    vectorizer = CountVectorizer(lowercase=False)
    train_X = vectorizer.fit_transform(train_vals)

    if process_function:
        test_vals = [' '.join(process_function(comment)) for comment in test_comments]
    else:
        test_vals = test_comments

    test_X = vectorizer.transform(test_vals)


    model = LogisticRegression()
    model.fit(train_X, train_labels)

    predicted_test_labels = model.predict(test_X)

    acc = accuracy_score(test_labels, predicted_test_labels)
    print('Test accuracy score: {:.3f}%'.format(acc))

    top_coef_args = np.argsort(np.abs(model.coef_[0]))[-5:]
    top_features = [vectorizer.get_feature_names()[x] for x in top_coef_args]
    top_coefs = model.coef_[0,top_coef_args]

    print('Top 5 most decisive features:')
    for feature, coef in reversed(list(zip(top_features,top_coefs))):
        print('\t{}: {:.3f}'.format(feature, coef))

    print('\n')

In [56]:
def read_csv(filename, comment_column, target_column, filter_values = {}, max_rows = 1000):
    print('Reading data from {}'.format(filename))
    comments = []
    labels = []
    with open(filename, encoding='utf-8') as f:
        reader = csv.DictReader(f)
        for row in reader:
            include= True
            for k,v in filter_values.items():
                if not k in row or row[k] != v:
                    include = False
            if include:
                comments.append(row[comment_column])
                labels.append(int(row[target_column]))

            if len(comments) >= max_rows:
                break

    return comments, labels

In [95]:
#Read in a couple datasets
import numpy as np
train_comments, train_labels = read_csv('aww_politics_comments_late_2012_train.csv','body','score', {'subreddit':'politics'})
mean_upvotes = np.mean(train_labels)
train_labels= [0 if int(x) < mean_upvotes else 1 for x in train_labels]

test_comments, test_labels = read_csv('aww_politics_comments_late_2012_test.csv','body','score', {'subreddit':'politics'})
test_labels = [0 if int(x) < mean_upvotes else 1 for x in test_labels]


print('Comments and labels')
for comment, label in list(zip(train_comments, train_labels))[0:5]:
    print('\t{}: "{}"'.format(label, comment[0:100]+'...'))

Reading data from aww_politics_comments_late_2012_train.csv
Reading data from aww_politics_comments_late_2012_test.csv
Comments and labels
	0: "It shows how low the Republican party is when 4 years before the next election Chris Christie is (as..."
	0: "What programs are those? I have no insurance, can't afford my own plan and my employer doesn't offer..."
	0: "[deleted]..."
	0: "Because Papa Johns going out of business = you get health insurance...... how about the fact that 16..."
	0: "Ehrmagerd! Look what she's doing to get out of talking about Benghazi! ..."


In [96]:
    
#Try running a machine learning experiment 
run_ML_experiment(train_comments, train_labels, test_comments, test_labels, process_function = nltk.word_tokenize)


Doing text classification experiment using word_tokenize as the processing function for the text
Test accuracy score: 0.766%
Top 5 most decisive features:
	please: 1.287
	Someone: 0.974
	correct: 0.966
	wow: 0.860
	Republican: 0.823




<h2>III. Stemming</h2>

Useful for matching different tenses of the same word

In [93]:
stemmer = nltk.stem.SnowballStemmer('english')

words = ['run','ran','running','runner','runs','runt']

for word in words:
    print('Word: "{}"; Stemmed: "{}"'.format(word, stemmer.stem(word)))


Word: "run"; Stemmed: "run"
Word: "ran"; Stemmed: "ran"
Word: "running"; Stemmed: "run"
Word: "runner"; Stemmed: "runner"
Word: "runs"; Stemmed: "run"
Word: "runt"; Stemmed: "runt"


In [94]:
#Let's try incorporating it into our classification function

def lowercase_and_stem(comment):
    tokens = nltk.word_tokenize(comment)
    tokens = [x.lower() for x in tokens]
    tokens = [stemmer.stem(x) for x in tokens]
    return tokens

comment = "I used to love coffee, but now I don't"

print()
print(comment)
print(lowercase_and_stem(comment))

run_ML_experiment(train_comments, train_labels, test_comments, test_labels, process_function = lowercase_and_stem)


I used to love coffee, but now I don't
['i', 'use', 'to', 'love', 'coffe', ',', 'but', 'now', 'i', 'do', "n't"]

Doing text classification experiment using lowercase_and_stem as the processing function for the text
Test accuracy score: 0.751%
Top 5 most decisive features:
	pleas: 1.593
	republican: 1.080
	bullshit: 1.062
	correct: 0.987
	one: 0.923




<h2>IV. Part-of-speech tagging</h2>
Sometimes it can be usedful to look at the parts of speech that are used, rather than the words themselves

In [70]:
#For more sophisticated functions like POS tagging, you have to download extra corpora
nltk.download('averaged_perceptron_tagger',download_dir='./nltk_data')

comment = 'This is an incisive comment about coffee'
tokens = nltk.word_tokenize(comment)
print('Tokens: {}'.format(tokens))
tags = nltk.pos_tag(tokens)
print('Part-of-speech tags: {}'.format(tags))
print('Tags only: {}'.format([x[1] for x in tags]))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     ./nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
Tokens: ['This', 'is', 'an', 'incisive', 'comment', 'about', 'coffee']
Part-of-speech tags: [('This', 'DT'), ('is', 'VBZ'), ('an', 'DT'), ('incisive', 'JJ'), ('comment', 'NN'), ('about', 'IN'), ('coffee', 'NN')]
Tags only: ['DT', 'VBZ', 'DT', 'JJ', 'NN', 'IN', 'NN']


In [72]:
#Let's try using this in the classification

def pos_tag(comment):
    tokens = nltk.word_tokenize(comment)
    tagged = nltk.pos_tag(tokens)
    return [x[1] for x in tagged]

comment = 'This is yet another trenchant comment about coffee'
print(pos_tag(comment))

run_ML_experiment(train_comments, train_labels, test_comments, test_labels, process_function=pos_tag)


['DT', 'VBZ', 'RB', 'DT', 'JJ', 'NN', 'IN', 'NN']

Doing text classification experiment using pos_tag as the processing function for the text
Test accuracy score: 0.787%
Top 5 most decisive features:
	SYM: 0.861
	RBS: 0.589
	PDT: 0.512
	FW: -0.487
	NNPS: 0.391




<h2>V. Named-entity recognition</h2>

Named entities such as people and organizations can be reconized with reasonable accuracy

In [77]:
comment = "Vienna had some of Europe's earliest coffee shops after its near-conquest by the Ottoman army of Mustafa Pasha"

tokens = nltk.word_tokenize(comment)
tags = nltk.pos_tag(tokens)
entity_tree = nltk.chunk.ne_chunk(tags)

entity_types = ['FACILITY', 'GPE', 'GSP', 'LOCATION', 'ORGANIZATION', 'PERSON']

entity_subtrees = [x for x in entity_tree.subtrees() if x.label() in entity_types]
entity_strings = [' '.join([y[0] for y in subtree.leaves()]) for subtree in entity_subtrees]

print (entity_strings)

['Vienna', 'Europe', 'Ottoman', 'Mustafa Pasha']


In [79]:
#Let's try incorporating this into our classification process

entity_types = ['FACILITY', 'GPE', 'GSP', 'LOCATION', 'ORGANIZATION', 'PERSON']
def get_named_entities(comment):
    tokens = nltk.word_tokenize(comment)
    tags = nltk.pos_tag(tokens)
    entity_tree = nltk.chunk.ne_chunk(tags)
    entity_subtrees = [x for x in entity_tree.subtrees() if x.label() in entity_types]
    entity_strings = [' '.join([y[0] for y in subtree.leaves()]) for subtree in entity_subtrees]
    return entity_strings

comment = 'Jeff Daniels made a donation to Red Cross International on behalf of the United States'

print(get_named_entities(comment))

run_ML_experiment(train_comments, train_labels, test_comments, test_labels, process_function=get_named_entities)


['Jeff', 'Daniels', 'Red Cross International', 'United States']

Doing text classification experiment using get_named_entities as the processing function for the text
Test accuracy score: 0.794%
Top 5 most decisive features:
	Someone: 1.030
	House: -0.995
	Democratic: 0.935
	Dem: 0.929
	Ohio: 0.888




<h2> VI. Bigrams </h2>
Sometimes you want to look at pairs of tokens instead of just individual tokens


In [86]:
comment = 'This is yet another comment about coffee'

print (nltk.word_tokenize(comment))

print (list(nltk.bigrams(nltk.word_tokenize(comment))))


['This', 'is', 'yet', 'another', 'comment', 'about', 'coffee']
[('This', 'is'), ('is', 'yet'), ('yet', 'another'), ('another', 'comment'), ('comment', 'about'), ('about', 'coffee')]


In [91]:
#Bigramify a couple of our earlier functions and test them out in the classifier
def bigram_lowercase_and_stem(comment):
    tokens = nltk.word_tokenize(comment)
    tokens = [x.lower() for x in tokens]
    tokens = [stemmer.stem(x) for x in tokens]
    return  ['_'.join(x) for x in list(nltk.bigrams(tokens))]

comment = 'Seriously I really need coffee right now'
print (bigram_lowercase_and_stem(comment))

run_ML_experiment(train_comments, train_labels, test_comments, test_labels, process_function=bigram_lowercase_and_stem)


['serious_i', 'i_realli', 'realli_need', 'need_coffe', 'coffe_right', 'right_now']

Doing text classification experiment using bigram_lowercase_and_stem as the processing function for the text
Test accuracy score: 0.788%
Top 5 most decisive features:
	bullshit_: 0.771
	when_you: 0.667
	good_guy: 0.649
	guy_finland: 0.649
	just_did: 0.622




In [92]:
def bigram_pos_tag(comment):
    tokens = nltk.word_tokenize(comment)
    tagged = nltk.pos_tag(tokens)
    return ['_'.join(y) for y in list(nltk.bigrams([x[1] for x in tagged]))]

comment = 'I am so tired right now'
print(bigram_pos_tag(comment))

run_ML_experiment(train_comments, train_labels, test_comments, test_labels, process_function=bigram_pos_tag)

['PRP_VBP', 'VBP_RB', 'RB_JJ', 'JJ_NN', 'NN_RB']

Doing text classification experiment using bigram_pos_tag as the processing function for the text
Test accuracy score: 0.738%
Top 5 most decisive features:
	IN_VBN: 1.435
	TO_NNS: 1.344
	TO_: 1.258
	UH_NN: 1.257
	CC_NNS: -1.202




<h2>VII. Conclusion: No clear best representation</h2>

The three best representations were: unigrams of parts-of-speech, bigrams of stemmed words, and unigrams of named entities. No super clear lesson to be learned about what causes upvotes and downvotes; more investigation would be needed.

<h2>VIII. Going further: word embeddings</h2>

In the past few years, word embeddings have become very popular as a way to build representations of text. They tend to lead to better downstream outcomes than other techniques. 

https://code.google.com/archive/p/word2vec/